Problem description
Develop a classification model given input images for training and testing, containing 3000 images in total, that amount to about 1 billion data points.
Simple solution
A custom model performs as good as the best model with only two clearly interpretable parameters.
More sophisticated solutions
We can use any one of a set of commonly used supervised machine learning algorithms for classification.
Lessons learned
Steps:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from mpl_toolkits import mplot3d
from imghdr import what
import numpy as np
import pandas as pd
import seaborn as sns
from plotnine import * # let's take full advantage of the grammar of plots
# from dfply import * # we could also take advantage of the grammer of data wrangling
import statsmodels as sm
from scipy import optimize
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA # Linear Discriminant
from sklearn.decomposition import PCA
# SVM
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# using PyTorch for deep nnets
#import torch
#import torch.nn as nn
#import torch.nn.functional as F
#import torch.optim as optim
#from torch.optim.lr_scheduler import StepLR
#from torchvision.io import read_image
#from torchvision import transforms #, datasets
import os
import shutil
import itertools
import json # to serialize intermediate model state and results
# Metadata saved to output plots
METADATA = {
"Author": "Hörmet Yiltiz",
"License": "CC-BY-NC-ND 4.0",
"Copyright": "Hörmet Yiltiz <hyiltiz@gmail.com>, 2021"
}
# Show plots inline
%matplotlib inline
The shape of the intermediate data structure:
[
{ -- One list element per image file
type: file type
name: file name
im: 3D image (in RGB)
},
...
]
# Read Images
def load_img(impath :str = './squares/train/a/'):
"""
Reads all images inside a folder and returns them in a list of dictionary
The returned data has the following structure:
[
{ -- One list element per image file
type: file type
name: file name
im: 3D image (in RGB)
}
]
"""
files = os.listdir(impath)
result = []
for ximg in files:
fname = f'{impath}{ximg}'
if what(fname): # exclude any non-image files if they exist
cell = dict()
cell['type'] = what(fname)
cell['im'] = mpimg.imread(fname)
cell['name'] = fname
result.append(cell)
return result
# this is the basic idea for the function above
img = mpimg.imread('./squares/train/a/0.jpg')
classes = ['a', 'b', 'c']
train = dict()
for label in classes:
train[label] = load_img(impath = f'./squares/train/{label}/')
# the test set for model evaluation and validation
# test = dict()
# for label in classes:
# test[label] = load_img(impath = f'./squares/val/{label}/')
# or using dict comprehension for a more concise code
test = {label: load_img(impath = f'./squares/val/{label}/') for label in classes}
Let's take a quick look at the data by simply showing a random sample for each class.
### let's take a quick look at the data
n_sample = 5
def plot_raw(data: dict, n_sample: int = 15, classes: list = classes):
"""
Creates a simple gallery plot based on a random sample
"""
cols, rows = n_sample, len(classes)
figure = plt.figure(figsize=(cols, rows))
for i, (icol, irow) in enumerate(itertools.product(range(rows), range(cols))):
ilabel = classes[icol]
sample_idx = np.random.randint(len(data[ilabel]))
img = data[ilabel][sample_idx]['im']
figure.add_subplot(rows, cols, i+1)
# plt.title(ilabel)
if irow == 0:
plt.text(0, 220, ilabel,
horizontalalignment='left',
verticalalignment='center')
plt.axis("off")
plt.imshow(img)
return figure
figure = plot_raw(train)
figure.suptitle('Train samples')
plt.show(figure)
figure = plot_raw(test)
figure.suptitle('Test samples')
plt.show(figure)
Each row shows samples for a class. We have three classes: a, b and c.
a) dark luminancec) high luminance (pastel)b) mid luminance (saturated colors) Each pixel contains 3 data (RGB triplets) and each image contains 10,000s of pixels. We can see most of the pixels are refundant.
Non-redundant information:
ximg = train['a'][3]['im']
def stats_rgb(img):
"""
return the unique rgb triplets present in an image
"""
rgb = ximg.reshape(ximg.size//3, 3)
# plt.imshow(ximg)
# print(ximg.shape)
full_set = np.unique(rgb, axis=0)
white = np.array([1.,1.,1.])
if all(full_set[-1, ] == white): # if last element is white
colors = full_set[0:-1] # then do not include it
return colors
def stats_rgb_fast(img):
"""
Simply take the RGB from the central pixel. Upon
initial analysis, we can confirm that all color
patches are uniform, thus no need to actually
reshape, and find the unique RGB triplets.
"""
# just return the central RGB triplet
return img[img.shape[0]//2, img.shape[1]//2, :]
def stats_size(img):
"""
return the length of a side of the colored square
"""
gray = img.sum(axis=2)
midrow = gray[gray.shape[0]//2, :]
side = np.count_nonzero(midrow - gray[0]) # ignoring whites
return side
def stats_size_fast(img):
"""
Only use the midrow and do less boolean operations.
"""
midrow = img[img.shape[0]//2, :, :]
return sum(midrow.sum(axis=1) < midrow[0].sum()) # white pixels start at edge
def reduce_dimensions(cell, label, fast :bool = True):
"""
Manually reduce dimensionality from images to unique rgb colors
since image has no shape or content except a square color patch
then return a row of a data frame
"""
img = cell['im']
if fast:
# ASSUMPTION: uniform color patch
rgb = stats_rgb_fast(img)
side = stats_size_fast(img)
n_colors = 1 # WRONG if above assumption is violated
else:
rgb = stats_rgb(img)[0] # ASSUMPTION: uniform color patch
side = stats_size(img)
n_colors = rgb.shape[0] # can tell us if assumption is violated
row = {
'name': cell['name'],
'label': label,
'type': cell['type'],
'side': side,
'height': img.shape[0],
'width': img.shape[1],
'r': rgb[0],
'g': rgb[1],
'b': rgb[2],
'n_colors': n_colors
}
return row
def data2df(data):
"""
Create a data frame by reducing the image dimension into the RGB color
of its central color patch, patch size and other metadata. Return a
DataFrame with one row per image.
"""
tbl = []
for label in list(train.keys()):
tbl += [reduce_dimensions(cell, label) for cell in data[label]]
return pd.DataFrame(tbl)
# Let's check the assumption we made earlier to see if we need to
# adjust the code
# do we only have a single uniform color for all images?
n_colors = list(map(lambda x: stats_rgb(x['im']).shape[0], train['a']))
assert(all(np.array(n_colors)==1)) # Yes!
label = 'a'
train_tbl = []
%timeit [reduce_dimensions(cell, label, False) for cell in train[label][0:10]]
Running it for the whole dataset for the first time took about 15 minutes in total! Above measurements predicted it as well: each image took about 150 ms.
We can profile the dimensionality reduction function to see what's taking so long.
# %timeit [reduce_dimensions(x, 'a') for x in train['a'][0:10]]
import cProfile
cProfile.run("reduce_dimensions(train['a'][0], 'a', True)")
cProfile.run("reduce_dimensions(train['a'][0], 'a', False)")
Computing the RGB compositions was costly!
Because it was essentially sorting a 4D array (1D: images, 3D: RGB channels).
We only have to check it once though. We discovered:
We can use these discoveries to come up with a faster dimensionality reduction algorithm. The fast algorithm:
As a result, the _fast version of our functions took 1 ms.
We gained 300x speed boost!
The original dimensionality reduction procedure took a while to compute. (Fortunately, we were able to make it much faster; we may not always be lucky with other data sets / problems.)
We can use this for all future analysis, without having to do any more costly disk operations.
Let's serialize intermediate data and store it as .csv file. While we are on it, create our DataFrame.
# now let's convert from our custom data structure into a pandas dataframe
train_df = data2df(train)
test_df = data2df(test)
# store it for later use so we do not need to redo the image loading
# and dimensionality reduction
train_df.to_csv('train.csv')
test_df.to_csv('test.csv')
# let's quickly look at what we have
display(train_df)
# and its descriptives
train_df.describe()
n_colors, when computed with the slow dimensionality reduction, has a std of 0. Therefore, the patch has uniform colors.
The RGB dimensions each has the same std, and R and G has the same mean and std. Therefore, the color patches were generated in the RGB space (rather than HSV, Lab space etc.)
height and width has exactly the same statistics. Therefore, colored patches are indeed squares.
This is statistics; do these conclusions hold for each individual image?
# Are any of them not squares?
any(train_df.height - train_df.width)
They are all squares!
print(ggplot(train_df) + aes(x='height', color='label') +
geom_density())
print(ggplot(train_df) + aes(x='side', color='label') +
geom_density())
print(ggplot(train_df) + aes(x='r', y='g', color='label') +
geom_point() + coord_equal())
Color channel r seems identical to color channel g. We saw this earlier from the statistics; now we can see it for each image.
The color channels r and g seem somewhat informative to tell apart the class labels.
print(ggplot(train_df) + aes(x='r', y='b', color='label') +
geom_point() + coord_equal())
b seems very informative to tell apart the class labels.blue channel: class ablue channel: class bblue channel: class cprint(ggplot(train_df) + aes(x='g', y='b', color='label') +
geom_point())
Same as before. Not surprising, since b = g.
print(ggplot(train_df) + aes(x='type', y='label') +
geom_bin2d())
Class label is uniformly distributed for file type. $\iff$
File type is totally uninformative. $\iff$
Knowing the file type doesn't tell us anything about the label.
Also, this must be a toy data set, without doubt. And its generative model in psudocode:
for filetype in ['bmp', 'jpeg', 'png']:
for class in [0, 1, 2]:
generate samples
saves images
def plot_3d_interactive(df = train_df):
"""
Plot interactive 3D scatter plot over the RGB space, and show metadata.
"""
import plotly as py
import plotly.graph_objs as go
label2color = {
'a': 'red',
'b': 'green',
'c': 'blue'
}
trace = go.Scatter3d(
x=df['r'],
y=df['g'],
z=df['b'],
mode='markers',
marker=dict(
size=3,
color=df['label'].map(label2color)
),
name= '3d',
# list comprehension to add text on hover
text= [f"type: {itype}<br>side: {iside}<br>width: {iwidth}"
for itype,iside,iwidth
in list(zip(df['type'],
df['side'],
df['width']))],
# suppress hover
hoverinfo='text'
)
layout = dict(title = 'Classes in RGB space',
scene=dict(
xaxis={'title': 'r'},
yaxis={'title': 'g'},
zaxis={'title': 'b'}
)
)
data = [trace]
fig = dict(data=data, layout=layout)
py.offline.iplot(fig, filename = '3D.html')
plot_3d_interactive(train_df)
What's up with the dot at the corner around (0, 0, 0)?
Let's ignore or just exclude them. Should we?
# png format stores RGB in [0,1] while others store in 0..255
extremes = (train_df.r < 1) & (train_df.g < 1) & (train_df.b < 1)
train_df[extremes]
The extreme values are all coming from .png files; that cannot be a coincicdence. In fact, none of the "non-extreme" values are .png either.
Actually, .png format stores RGB in $[0, 1]$ as floating point numbers while others store as unsight 8 bit integers, in $\{0, 1, ..., 255\}$.
Let's scale it.
# let's bring all the color space within [0,1]
def scale_colorspace(df):
"""
scale into RGB represented in [0, 1]
"""
# While floats [0, 1] takes more CPU to compute,
# model coefficients will be easier to interpret
df_scaled = df.copy()
rows_to_scale = (df.r >= 1) & (df.g >= 1) & (df.b >= 1)
df_scaled.loc[rows_to_scale, ('r', 'g', 'b')] = df.loc[rows_to_scale, ('r', 'g', 'b')]/255
# do it during pre-processing to avoid over-computation
df_scaled['labeln'] = df_scaled['label'].map({'a': 0, 'b': 1, 'c': 2})
return df_scaled
train_scaled = scale_colorspace(train_df)
test_scaled = scale_colorspace(test_df)
# now let's replot the scatter plot
plot_3d_interactive(train_scaled)
train_scaled
Extreme values no more!
We saw blue channel was most informative, while the rest aren't much. But by how much? How can we quantify it?
We can use mutual information between two discrete processes X and Y. If knowing X tells us a lot about what Y is, then their mutual information will be high.
Note: Mutual information is a generalization of statistical correlation, between arbitrary processes.
We assume neither.
Using my own library for discrete state sapce information theoretic analysis: https://github.com/hyiltiz/info
# blue value seems mostly informative
# by how much?
# Import my information theory module
# https://github.com/hyiltiz/info
# Here including the relevant sections lines of the code for completeness
def compute_mutual_information(X, Y, domain):
observed = list(zip(X, Y))
N = len(observed)
domain = ((0, 255), (0, len(classes)))
# NOTE: range() upper range is larger by 1
grid = itertools.product(range(domain[0][0], domain[0][1]),
range(domain[1][0], domain[1][1]))
# compute the joint frequency table
observed_counts = [observed.count(coord) for coord in grid]
cXY = np.array(observed_counts).reshape(domain[0][1], domain[1][1])
# compute the marginal counts
[cX, cY] = np.meshgrid(cXY.sum(axis=0), cXY.sum(axis=1))
# now normalize into probability mass functions
[pXY, pX, pY] = [cXY/N, cX/N, cY/N]
pXpY = pX * pY
# get rid of the zero events (only in the joint is sufficient)
# 0 * log(0/[possibly zero]) := 0 for information
# this limit actually comes from the expectation definition of information
nonzeros = pXY != 0
nonzeros_X = pX[0, :] != 0
nonzeros_Y = pY[:, 0] != 0
# -inf < pmi < min(-log(pX), -log(pY))
pointwise_mutual_information = np.log2(pXY[nonzeros]/pXpY[nonzeros])
# the expected value of the pointwise MI
mutual_XY = sum(pXY[nonzeros] * pointwise_mutual_information)
# now compute other relevant information metrics
# the self-information
self_information_X = -np.log2(pX[0, :])
# entropy := expected self information
HX = sum(pX[0, :][nonzeros_X] * self_information_X[nonzeros_X])
# print(f"====\nmutual_XY:{mutual_XY}, "
# f"normalied by entropy(X): {mutual_XY/HX}\n, "
# f"{out['mutual-information']}")
# import ipdb; ipdb.set_trace() # BREAKPOIN for debugging
return (mutual_XY, mutual_XY/HX)
def analyze_mi(data = train_scaled):
"""Perform information theoretic analysis between target class labels
and predictors. Computae mutual information between class labels (discrete)
and discrete predictors (or binned), using custom implemented function.
Mutual information can be viewed as a generalized measure of correlation,
between arbitrary processes, whereas paerson correlation assumes linearity,
and spearman correlation assumes monotonocity. Normalized mutual information
tells us how much percentage of the information required for calssification
is contained in the predictor. For example, if we have 3 classes, we need
log2(3) bits of information from a sample to correctly classify; normalization
simplify scales mutual by this factor.
"""
# the algorithm is more robust for discrete samples
train255 = train_scaled.copy()
train255.loc[:, ('r', 'g', 'b')] = train255.loc[:, ('r', 'g', 'b')]*255
train255 = train255.astype({
'r': 'int',
'g': 'int',
'b': 'int',
'height': 'int'
})
train255['labeln'] = train255['label'].map(ord) - 97
train255['typen'] = train255['type'].map({
'png': 0,
'jpeg': 1,
'bmp': 2})
domain_color = ((0, 255), (0, 2))
# first prepare all the input predictors as a list
# then submit for computation
# the computation itself can be parallel and distributed
kwargs = [
(train255.r, train255.labeln, domain_color),
(train255.g, train255.labeln, domain_color),
(train255.b, train255.labeln, domain_color),
(train255.side, train255.labeln, ((0,train255.side.max()), (0,2))),
(train255.typen, train255.labeln, ((0,train255.typen.max()), (0,2)))
]
rows = ['red', 'green', 'blue', 'side', 'filetype']
columns = ['mutual info (bits)', 'normalized (%)'] # of log2(3) where 3 is #class
mi_tbl = list(map(lambda x: compute_mutual_information(*x), kwargs))
mi_df = pd.DataFrame(mi_tbl, columns = columns)
mi_df['predictor'] = rows
return(mi_df)
# information theoretic analysis for the train set
mi_df = analyze_mi(train_scaled)
display(mi_df)
# first reshape it into the long format
mi_long = mi_df.melt('predictor')
mi_long.rename(columns = {'value': 'information', 'variable': 'measure'}, inplace=True)
# rearrange the order
rows = ['red', 'green', 'blue', 'side', 'filetype']
mi_long['predictor'] = mi_long['predictor']\
.astype(pd.CategoricalDtype(ordered=True))\
.cat.reorder_categories(rows,
ordered=True)
Normalized mutual information tells us how much percentage of the information required for calssification is contained in the predictor.
For example, we have 3 classes. So we need log2(3) bits of information (from a sample to correctly classify it). Normalization simplify scales mutual by this factor. Therefore, it is a percentage.
print(ggplot(mi_long) +
aes(x='predictor', y='information', shape='measure', linetype='measure', group='measure') +
geom_line() + geom_point(size=6) +
theme_xkcd(base_size=20) +
ggtitle('If you are a fan of information theory and xkcd')
)
97.4% of the information required for classification; almost perfectly informative;8.7% informative; almost useless.0% information; totally useless.Data description
We have a set of simple images and a few classes. The images are split into training set (around 1500 images, 500 images per class) and a test set (1500 images, 500 images per class). The images are of several different common image binary storate formats.
Problem description
We need to design, implement, train and validate a classification model.
Confirmed observations
We could fire at the problem with big cannons from supervised machine learning such as:
Or even unsupervised such as kmeans.
But we've seen that:
r is identical to channel gb contains almost all the informationSo we can do without any of them; just do a geometric projection and threshold it.
We can flatten the 3D dot cloud by rotating along channel b by 45 degrees (because b=g describes a 45 degree line on b-g axes). Rotation is a linear transformation, which can be represented as a matrix multiplication...
# Further decrease dimensions by project into 2D by rotation.
angle = np.pi/4
rg_diag_unit = [1/np.cos(angle), 0, 0]
# matrix product as a linear projection
train_scaled['rg_diag'] = train_scaled.loc[:, ('r', 'g', 'b')] @ rg_diag_unit
print(ggplot(train_scaled) +
aes(x='rg_diag', y='b', color='label') +
geom_point()
)
We further decreased the dimensions to only 2. Now the classifier seems obvious: just draw two horizontal lines (and use them as thresholds/boundaries for classification).
We can eyeball a guess from the plot.
thresholds_guess = [0.35, 0.65]
print(ggplot(train_scaled) +
aes(x='rg_diag', y='b', color='label') +
geom_point() +
geom_hline(yintercept=thresholds_guess[0], linetype='dashed', color='blue') +
geom_hline(yintercept=thresholds_guess[1], linetype='dashed', color='blue') +
ggtitle('Initial guess for our geometric classifiers\n(dashed blue lines represent classification thresholds)')
)
# try a guess
thresholds_init = [0.35, 0.65]
def blue_channel_classifier(data = train_scaled, thresholds = [0.35, 0.65]):
"""
Classification based on the blue channel. Compute error given two thresholds.
"""
# initialize to label b
class_pred = np.ones(data.b.shape)
# vectorized evaluation
class_pred[data.b < thresholds[0]] = 0 # class a
class_pred[data.b >= thresholds[1]] = 2 # class c
# not RMSE, just classification error, as percentage
error = np.mean(data.labeln != class_pred)
return (class_pred, error)
blue_channel_classifier(data = train_scaled, thresholds = thresholds_init)[1]
1.6% of error for an informed guess, not bad! But how can we do better? Optimization. Specifically, we can systematically vary where to place the two thresholds using optimization; the best two horizontal lines should lead to the smallest classification error.
Thus, we define our cost function as this classification error, given two thresholds.
The model is an optimization for the best two horizontal lines that minimizes misclassification error from the training data set.
def model_projection(train, test):
"""
Fit geometric model via optimization and report model performance.
"""
cost_f = lambda x: blue_channel_classifier(train_scaled, x)[1]
thresholds_init = [0.35, 0.65]
res = optimize.minimize(cost_f,
thresholds_init,
method='nelder-mead',
options={'disp': False},
bounds=optimize.Bounds([0,1], [0,1])
)
print('After optimization, minimzied training error to:', res.fun)
# with optimization, we reached 1.47% by only using two parameters
# to classify the original 1500 images
# that contained 1048703688 ~ 1 billion data points
# with just two interpretable parameters!
test_pred, generalization_error = blue_channel_classifier(data = test_scaled, thresholds = res.x)
print('Generalization error computed from trained model on the test set is: ', generalization_error)
domain_label = ((0,2), (0,2))
mi, mi_scaled = compute_mutual_information(test_pred, test_scaled.labeln, domain_label)
return({
'model': 'projection-optimization',
'learning-error': res.fun,
'generalization-error': generalization_error,
'mutual-information': mi,
'mutual-information-normalized': mi_scaled,
'degree-of-freedom': len(res.x)
}, res)
ireport, res = model_projection(train_scaled, test_scaled)
report = []
report.append(ireport)
ireport
The optimization space plots training error as a function of the model parameters (the heights of the two horizontal cutoff lines). We see that:
def plot_projection_optimization(thresholds_init,
train_scaled,
res):
"""Plot the cost function over the parameters to be optimized"""
# Make a grid to evaluate the function (for plotting)
xg, yg = np.meshgrid(np.linspace(0, 1), np.linspace(0, 1))
# turn it into ternary function to vectorize over matrices
cost_f = lambda x: blue_channel_classifier(train_scaled, x)[1]
cost_fv = np.vectorize(lambda x, y: cost_f([x, y]))
cost_surface = cost_fv(xg, yg)
plt.figure()
plt.contourf(cost_surface, extent=[0, 1, 0, 1], origin="lower", cmap='gray_r', levels=50)
plt.plot(res.x[0], res.x[1], 'ro')
plt.plot(thresholds_init[0], thresholds_init[1], 'bo')
plt.title('Optimization space')
plt.xlabel('$threshold_1$')
plt.ylabel('$threshold_2$')
plt.legend(['Optimal thresholds', 'Initial guess'])
plt.colorbar()
# how easily could we get stuck in a local minimum?
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(xg, yg, cost_surface, rstride=1, cstride=1,
cmap='viridis', linewidth=0, antialiased=False)
ax.set_xlabel('$threshold_1$')
ax.set_ylabel('$threshold_2$')
ax.set_zlabel('Train error')
ax.set_title('Optimization space')
ax.azim = -30
ax.elev = 30
ax.dist = 10
plot_projection_optimization(thresholds_init = [0.35, 0.65],
train_scaled=train_scaled,
res=res)
1.4%, and 2.1% generalization error (for unseen/new data set, computed using the test set). 90.8% informative to classify an unseen sample.Our custom built model had satisfactory performance. How does it compare to standard models for multi-class supervised learning? Let's try:
# let's try a better classifier with a non-linear kernel to make use
# of the rg_diag and the side of the color patch
def prep_inputs(train, test):
"""
Prepare the inputs for a more general training.
"""
predictors = ['rg_diag', 'b', 'side']
angle = np.pi/4
rg_diag_unit = [1/np.cos(angle), 0, 0]
train_scaled['rg_diag'] = train_scaled.loc[:, ('r', 'g', 'b')] @ rg_diag_unit
X = train.loc[:, predictors]
y = train.labeln
# prepare the test input
test_scaled['rg_diag'] = test_scaled.loc[:, ('r', 'g', 'b')] @ rg_diag_unit
X_test = test.loc[:, predictors]
y_test = test.labeln
return (
(X,y), # train set
(X_test, y_test) # test set
)
(X, y), (X_test, y_test) = prep_inputs(train_scaled, test_scaled)
def model_svd(data, n_components, model_name='LDA'):
"""Classify after doing a svd. Model name can be a LDA or PCA model.
"""
model_map = {
'LDA': (LDA, lambda m: m.coef_.size),
'PCA': (PCA, lambda x: 0)
}
model_f, df = model_map[model_name]
# data
(X, y), (X_test, y_test) = data
model = model_f(n_components=n_components)
if model_name == 'PCA':
X_r = model.fit(X).transform(X)
else:
X_r = model.fit(X, y).transform(X)
# print(X_r.shape)
# model performance profile
try:
test_pred = model.predict(X_test)
train_pred = model.predict(X)
except AttributeError as e:
test_pred = np.zeros(y_test.shape) + np.NaN
train_pred = np.zeros(y.shape) + np.NaN
domain_label = ((0,2), (0,2))
train_error = 1 - np.mean(train_pred == y)
generalization_error = 1 - np.mean(test_pred == y_test)
try:
(mi, mi_scaled) = compute_mutual_information(test_pred,
y_test,
domain_label
)
except ZeroDivisionError as e:
(mi, mi_scaled) = (np.nan, np.nan)
ireport = {
'model': f'{model_name}{n_components}',
'learning-error': res.fun,
'generalization-error': generalization_error,
'mutual-information': mi,
'mutual-information-normalized': mi_scaled,
'degree-of-freedom': df(model)
}
# plot the model's first two components
plt.figure()
colors = ["red", "green", "blue"]
lw = 2
# if there is only one component
# put 0s into the 2nd component for visualization
if n_components == 1:
X_r = np.concatenate([X_r, X_r], axis=1)
X_r[:,1] = 0
classes = y.unique()
for color, i, target_name in zip(colors, [0, 1, 2], classes):
plt.scatter(X_r[y == i, 0], X_r[y == i, 1],
color=color, alpha=0.8, lw=lw, label=target_name)
plt.legend(loc="upper right", shadow=False, scatterpoints=1)
plt.title(model_name)
return(ireport, model)
Let's compare against PCA.
ireport, m = model_svd(((X, y), (X_test, y_test)) , 2, 'PCA')
report.append(ireport)
We must be careful what PCA minimizes and how!
Let's try extracting 2 and 1 components.
ireport, m = model_svd(((X, y), (X_test, y_test)) , 2, 'LDA')
report.append(ireport)
ireport, m = model_svd(((X, y), (X_test, y_test)) , 1, 'LDA')
report.append(ireport)
Linear discriminant tries to maximize the class difference.
b channel.y dimension of the LDA with 2 dimensions. Not the same as just taking b.Let's try a few commonly used kernels for models of varying complexity.
def classify_svm(train, test, kernel: str, degree: int):
"""Fit a support vector machine with the specified kernel, and return
train and generalization error.
"""
svm = make_pipeline(StandardScaler(), SVC(
gamma='auto', # only useful for the rbf kernel
kernel=kernel,
degree=degree, # only useful for the poly kernel
cache_size=500,
probability=True
))
svm.fit(X, y)
train_error = 1 - svm.score(X, y)
generalization_error = 1 - svm.score(X_test, y_test)
(mi, mi_scaled) = compute_mutual_information(svm.predict(X_test),
y_test,
((0,2), (0,2))
)
df = svm['svc'].n_support_.sum()
# give the model a descriptive name
if kernel == 'poly':
name = f'{kernel}{degree}'
else:
name = f'{kernel}'
ireport = {
'model': f'svm-{name}',
'learning-error': train_error,
'generalization-error': generalization_error,
'mutual-information': mi,
'mutual-information-normalized': mi_scaled,
'degree-of-freedom': df
}
return(ireport, svm)
for kernel in ['linear', 'poly', 'rbf', 'sigmoid']:
if kernel == 'poly':
for degree in [1, 2, 3]:
[ireport,svm] = classify_svm(train_scaled, test_scaled,
kernel, degree)
report.append(ireport)
else:
# the degree is meaningless for non-polynomial kernels
[ireport,svm] = classify_svm(train_scaled, test_scaled,
kernel, 0)
report.append(ireport)
We have our custom model, LDA and SVM. Let's compare their errors and information metrics.
# report = report[0:4]
report_df = pd.DataFrame(report)
report_df.sort_values(by=['generalization-error', 'mutual-information-normalized'], inplace=True)
report_df['best'] = 3
report_df.loc[report_df['model']=='svm-poly1', 'best'] = 1
report_df.loc[report_df['model']=='projection-optimization', 'best'] = 2
report_df.loc[report_df['model']=='PCA2', 'generalization-error'] = np.nan
report_df['best'] = report_df['best'].astype('category')
display(report_df)
Best model svm-poly1 is a support vector machine with polynomial kernels with 1 degree (linear with a constant). But how does it work?
SVM will first expand the predictor vectors using a kernel. For example, the best model svm-poly1 will first apply a poly1 kernel (it is just the formula poly1: $x \to ax + b$), meaning each predictor $x$ expands into two other predictors $ax$ and $b$ by the formula, essentially scaling the predictor by some factor then shifting it by some amount. $a$ and $b$ are some of the free parameters of the SVM. In addition to tweaking these free parameters to find the best classifier, SVM also chooses some data points (after the kernel transform) near the classification boundary as the support vectors, which will be used for making a classification decision.
Compared to the best model svm-poly1, our model projection-optimization
Our model was much less complex, much more interpretable, and tailed for this specific problem.
svm-poly3 was a more complex model, with a non-linear kernel and more free parameters. Although svm-poly3 had less error during learning, but it has poor generalization error (when it comes to classifying unseen dataset). This is an sign of overfitting!
gg1 = (ggplot(report_df) +
aes(x='model', y='mutual-information-normalized',
size='degree-of-freedom',
color='best'
) +
theme(axis_text_x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_y_continuous(limits=[0,1]) +
scale_color_manual(values=['red', 'blue', 'black']) +
geom_point()
)
gg2 = (ggplot(report_df) +
aes(x='model', y='generalization-error',
size='degree-of-freedom',
color='best'
) +
theme(axis_text_x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_y_continuous(limits=[0,1]) +
scale_color_manual(values=['red', 'blue', 'black']) +
geom_point()
)
print(gg1)
print(gg2)
The problem was vastly simplified by the initial dimensionality reduction from image pixel space into just the color of the central pixel. That is idealistic. In general, feature extraction may involve (deep) convolutional neural networks (CNN), unsupervized learning methods and designing custom kernels.
Trainign model was performed on the full data set (batch learning). We could adopt an on-line learning classifier that can take one sample at a time. This will avoid re-training as more data accumulates over time, and is more convenient to provide as an API for others to use.
Also, being able to put a confidence on our prediction, by means of a confidence interval or a point probability for our prediction would be highly valuable to make practical decisions; low confidence predictions should weight less for a business decision and we should rely more on high-confidence predictions.
Most of the models above can be naturally extended to predict confidence for each of its predictions. Since it was not required for this case study, the topic is not explored at all.
Simply being able to predict is less valuable for making informed decisions in practice as compared to having insights into the causal links that are implied by the data. Building causal models that enables us to make strong causal predictions will be valuable in practice.
While in general, more elaborate models lead to better prediction performance and better generalization, they also take more resources (time, storage, input data) to run. It is thus important to consider the computational time and space complexity of the algorithms for the given problem. It is also possible and important to consider optimizing the same algorithm, which is usually possible with more insight and mathematical derivations.
Migrating to an online-algorithm may help since only one sample is used for training at a time, but we would still need to be careful. In the case that the model architecture needs an adjustment for an on-line learning model, we would still need to learn using all the data all over again.
I am using functional style in this demo. Specifically:
This allows:
I am open to restricting myself to any specific set of libraries, such as the standard libraries and those with commercial friendly licenses (like MIT, BSD etc.) or ones developed in-house, as favored by the team. I can quickly pick up new libraries and languages.
Thank you for taking the time to read through this rather lengthy report. Let me know if you have any feedback, suggestions or if you spotted any mistakes, at hormet.yiltiz@gmail.com.